1. Introduction

In the realm of predictive policing, the quest to enhance public safety through data-driven methodologies has gained significant momentum. By harnessing advancements in data analytics and machine learning, law enforcement agencies aim to proactively pinpoint areas at an elevated risk of criminal activity, enabling more strategic resource allocation. Yet, the critical challenge lies in overcoming biases within the data and models to ensure outcomes that are equitable and just across all communities.

In confronting this challenge, our project delves into the complex issue of urban crime in Chicago, striving to develop a predictive model that leverages data on abandoned buildings and streetlights—a reflection of the city’s battle with urban decay and segregation’s legacy.

In confronting this challenge, our project delves into the complex issue of urban crime in Chicago, striving to develop a predictive model that leverages data on abandoned buildings and streetlights—a reflection of the city’s battle with urban decay and segregation’s legacy. The backdrop of Chicago’s urban landscape, marked by vacant and dilapidated properties resulting from decades of discriminatory practices like redlining, sets the stage for our investigation (WTTW News, 2022). This historical context, as underscored by recent studies, reveals a deep-seated pattern of neglect that has not only fueled poverty and population loss but has also become a fertile ground for criminal activities, including a surge in robberies that peaked in 2023, marking the most challenging period in over a decade (Illinois Policy, 2024).

Our model is designed to predict robbery occurrences, with a keen eye on avoiding selection bias that could perpetuate systemic inequalities. Through iterative refinements and the inclusion of innovative predictors, we aim to enhance both accuracy and generalizability, directly addressing the biases ingrained in the data.

This endeavor navigates the intricacies of predictive policing, engaging with profound questions of fairness, transparency, and ethical considerations. By thoroughly explaining our model’s construction and rigorously assessing its performance, we aim to shed light on the potential benefits and drawbacks of implementing such algorithms in practice. Through this exploration, our goal is to contribute thoughtfully to the dialogue on the responsible use of predictive policing technologies, always mindful of their wider societal implications.

# Read and process police districts data
policeDistricts <- 
  st_read("https://data.cityofchicago.org/api/geospatial/fthy-xz3r?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%  # Transform coordinate reference system
  dplyr::select(District = dist_num)  # Select only the district number, renaming it to 'District'

# Read and process police beats data
policeBeats <- 
  st_read("https://data.cityofchicago.org/api/geospatial/aerh-rz74?method=export&format=GeoJSON") %>%
  st_transform('ESRI:102271') %>%  # Transform coordinate reference system
  dplyr::select(District = beat_num)  # Select only the beat number, renaming it to 'District'

# Combine police districts and beats data into one dataframe
bothPoliceUnits <- rbind(
  mutate(policeDistricts, Legend = "Police Districts"),  # Add a 'Legend' column and label for police districts
  mutate(policeBeats, Legend = "Police Beats")  # Add a 'Legend' column and label for police beats
)

# Read and process shotspotter
# shotspotter <- 
#   st_read("https://data.cityofchicago.org/resource/3h7q-7mdb.geojson") %>%
#   st_transform('ESRI:102271')

# Read crime data 2022
crime_2022 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2022/9hwr-2zxp")

#Robbery is one of the top 10 highest crime in Chicago #8
crime_summary <- crime_2022 %>%
  group_by(Primary.Type) %>%
  summarise(total_crimes = n(), .groups = "drop")

#Armed-handgun robbery is the highest amongst all
crime_ROB_HANDGUN <- crime_2022 %>%
  filter(Primary.Type == "ROBBERY") %>%
  group_by(Description) %>%
  summarise(total_crimes = n(), .groups = "drop")

# Read and process robberies with handgun data
robbery <- 
  crime_2022 %>% 
  filter(Primary.Type == "ROBBERY" & Description == "ARMED - HANDGUN") %>%  # Filter robbery data
  mutate(x = gsub("[()]", "", Location)) %>%  # Clean location data
  separate(x, into = c("Y", "X"), sep = ",") %>%  # Separate X and Y coordinates
  mutate(X = as.numeric(X), Y = as.numeric(Y)) %>%  # Convert coordinates to numeric
  na.omit() %>%  # Remove rows with missing values
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%  # Convert to sf object with specified CRS
  st_transform('ESRI:102271') %>%  # Transform coordinate reference system
  distinct()  # Keep only distinct geometries

# Read and process Chicago boundary data
chicagoBoundary <- 
  st_read(file.path(root.dir, "/Chapter5/chicagoBoundary.geojson")) %>%  # Read Chicago boundary data
  st_transform('ESRI:102271')  # Transform coordinate reference system

2. Exploratory Data Analysis

The analysis of armed gun robberies in Chicago for 2022 is depicted through two maps, each offering unique insights. Figure 2.1 provides a detailed distribution of incidents citywide, highlighting the widespread nature of the issue and emphasizing the need for broad-based interventions. In the other hand, Figure 2.2 identifies hotspots in the West, Southwest, and South Districts, enabling targeted resource allocation and strategic planning. However, this approach risks oversimplifying the problem and neglecting other areas in need of attention. A balanced approach, considering both citywide vigilance and targeted interventions, is essential for effective crime prevention and response.

Point-based map details widespread armed gun robbery incidents, advocating for comprehensive interventions, while density-based map targets concentrated crime areas, risking oversimplification, highlighting the need for a balanced approach

# A map of your outcome of interest in point form, with some description of what, when, and why you think selection bias may be an issue.

# Uses grid.arrange to organize independent plots
grid.arrange(
  ncol = 2,
  
  # Plot 1: Robberies overlaid on Chicago boundary
  ggplot() + 
    geom_sf(data = chicagoBoundary) +  # Add Chicago boundary
    geom_sf(data = robbery, colour = "#F21A00", size = 0.1, show.legend = "point") +  # Overlay burglaries
    labs(title = "Robberies, Chicago - 2022", caption = "Figure 2.1") +  # Set plot title
    theme_void(),  # Use a blank theme
  
  # Plot 2: Density of robberies with contours overlaid on Chicago boundary
  ggplot() + 
    geom_sf(data = chicagoBoundary, fill = "grey40") +  # Add Chicago boundary with grey fill
    stat_density2d(data = data.frame(st_coordinates(robbery)),  # Compute 2D kernel density estimate
                   aes(X, Y, fill = ..level.., alpha = ..level..),  # Define aesthetics for density contours
                   size = 0.01, bins = 40, geom = 'polygon') +  # Set size and number of bins for contours
    scale_fill_viridis() +  # Use Viridis color scale for fill
    scale_alpha(range = c(0.00, 0.35), guide = FALSE) +  # Set transparency range for contours
    labs(title = "Density of Robberies", caption = "Figure 2.2") +  # Set plot title
    theme_void() + theme(legend.position = "none")  # Use a blank theme and remove legend
)

As another approach, the creation of a fishnet provides a nuanced understanding of density distribution within specific areas, offering insights into the concentration of robberies. Utilizing a 500-meter-sized fishnet, each cell effectively portrays the density of incidents, ranging from 0 to 20, thereby enabling a detailed examination of crime patterns. This method proves instrumental in depicting the spatial intricacies of density and serves as a valuable tool for analyzing localized crime trends with precision. (See Figure 2.3)

## using {sf} to create the grid
## Note the `.[chicagoBoundary] %>% ` line. This is needed to clip the grid to our data
fishnet <- 
st_make_grid(chicagoBoundary,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary] %>%            # fast way to select intersecting polygons
  st_sf() %>%   mutate(uniqueID = 1:n())

## add a value of 1 to each crime, sum them with aggregate
crime_net <- 
  dplyr::select(robbery) %>% 
  mutate(countRobberies = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countRobberies = replace_na(countRobberies, 0),
         uniqueID = 1:n(),
         cvID = sample(round(nrow(fishnet) / 24), 
                       size=nrow(fishnet), replace = TRUE))

ggplot() +
  geom_sf(data = crime_net, aes(fill = countRobberies), color = NA) +
  scale_fill_viridis("Count of Robberies") +
  labs(title = "Count of Robberies for the fishnet", caption = "Figure 2.3") +
  theme_void()

3. Modeling

In the modeling section, we approach geospatial risk modeling as a regression model, aiming to predict discrete events such as crime occurrences, fires, successful sales calls, or even the locations of establishments like donut shops. These predictions serve as estimates of the likelihood of the event happening at particular geographical locations.

Focusing on crime prediction, we operate under the premise that the risk of crime is shaped by a multitude of geospatial factors, ranging from those that amplify risk to those that provide protective measures. In our study, we opt to incorporate two primary features indicative of urban decay: the presence of vacant buildings and areas characterized by lights being switched off. Visualizing the spatial distribution of these features, as illustrated in Figure 3.1, is crucial. We observe that while the distribution of areas with lights-off appears relatively uniform across the city, abandoned buildings tend to be more concentrated in specific regions. These non-uniform distributions inherently introduce biases into the analysis.

This bias could potentially lead to unequal resource allocation, with certain areas receiving more frequent patrols or heightened policing efforts while others are neglected. Such disparities in resource allocation risk exacerbating existing imbalances, ultimately perpetuating inequities in urban communities.

## only pulling a single variable for our model to keep it simple
## using Socrata again
# Read the abandoned buildings data in the City of Chicago
abandonBuildings <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Vacant-and-Abandoned-Building/7nii-7srd") %>%
    mutate(year = substr(date_service_request_was_received,1,4)) %>%  filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Abandoned_Buildings")

# Read the street lights out data in the City of Chicago
streetLightsOut <- 
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Street-Lights-All-Out/zuxi-7xem") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2017") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
    st_transform(st_crs(fishnet)) %>%
    mutate(Legend = "Street_Lights_Out")

# Read neighborhood boundaries for Chicago and transform to match fishnet CRS
neighborhoods <- 
  st_read("https://raw.githubusercontent.com/blackmad/neighborhoods/master/chicago.geojson") %>%
  st_transform(st_crs(fishnet))

    # Plot 1: Abandoned buildings and street lights out on Chicago neighborhooods
  ggplot() + 
    geom_sf(data = neighborhoods) +  # Add Chicago boundary
    geom_sf(data = streetLightsOut, colour = "#E1AF00", size = 0.1, show.legend = "point") + # Overlay burglaries
    geom_sf(data = abandonBuildings, colour = "#3B9AB2", size = 0.1, show.legend = "point") +
        labs(title = "Abandoned Buildings and Street Light Out, Chicago - 2022", caption = "Figure 3.1") +  # Set plot title
    theme_void()  # Use a blank theme

The underlying assumption is that as exposure to these risk-inducing factors increases, so does the likelihood of crime occurring. Conversely, the presence of protective factors, like well-lit areas, may mitigate the risk of crime.

However, these two variables can be problematic, we’ve identified four main biases:

  1. Spatial Bias: The distribution of vacant buildings and areas with lights-off may not be uniform across different neighborhoods or regions. Certain areas, particularly those with lower socioeconomic status or marginalized communities, may have a higher concentration of vacant buildings or areas with lights-off.
  2. Socioeconomic Bias: Vacant buildings and areas with lights-off are often associated with neighborhoods experiencing economic decline, poverty, or disinvestment. By incorporating these variables into the model, there’s a risk of reinforcing existing socioeconomic biases.
  3. Temporal Bias: The presence of vacant buildings and lights-off can vary over time due to factors such as urban development, infrastructure changes, or seasonal fluctuations. Failing to account for temporal variations in these variables may result in biased predictions.
  4. Data Quality Bias: The accuracy and completeness of data regarding vacant buildings and lights-off can vary. Inaccuracies or missing information in these datasets can lead to biased model predictions.
# Join the abandoned buildings and street lights out data with the fishnet grid based on spatial intersection
vars_net <- 
  rbind(abandonBuildings, streetLightsOut) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    dplyr::select(-`<NA>`) %>%
    na.omit() %>%
    ungroup()

# Convenience aliases to reduce the length of function names
vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      mapTheme()}

do.call(grid.arrange,c(mapList, ncol=3, top="Risk Factors by Fishnet"))

st_c <- st_coordinates
st_coid <- st_centroid

vars_net <-
  vars_net %>%
    mutate(
      Abandoned_Buildings.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(abandonBuildings),3),
      Street_Lights_Out.nn =
        nn_function(st_c(st_coid(vars_net)), st_c(streetLightsOut),3))

## Visualize the NN feature
vars_net.long.nn <- 
  dplyr::select(vars_net, ends_with(".nn")) %>%
    gather(Variable, value, -geometry)

## important to drop the geometry from joining features
final_net <-
  left_join(crime_net, st_drop_geometry(vars_net), by="uniqueID") 

final_net <-
  st_centroid(final_net) %>%
    st_join(dplyr::select(neighborhoods, name), by = "uniqueID") %>%
    st_join(dplyr::select(policeDistricts, District), by = "uniqueID") %>%
      st_drop_geometry() %>%
      left_join(dplyr::select(final_net, geometry, uniqueID)) %>%
      st_sf() %>%
  na.omit()

Through the Local Moran’s I model, we unearthed a noteworthy pattern: robbery occurrences exhibit concentration within specific central zones, aligning with areas characterized by significant hotspots and heightened local Moran’s I values. Intriguingly, the distribution of p-values remains relatively uniform, underscoring the model’s overall significance.

In essence, a significant hotspot within the robbery data denotes regions where a pronounced clustering of high robbery activity occurs in comparison to their surroundings. This clustering phenomenon implies that certain locales experience a disproportionately higher frequency of robberies than what random chance would predict alone.

## generates warnings from PROJ issues
## {spdep} to make polygon to neighborhoods... 
final_net.nb <- poly2nb(as_Spatial(final_net), queen=TRUE)
## ... and neighborhoods to list of weigths
final_net.weights <- nb2listw(final_net.nb, style="W", zero.policy=TRUE)

# print(final_net.weights, zero.policy=TRUE)
final.net_df <- as.data.frame(final_net)

## see ?localmoran
local_morans <- localmoran(final_net$countRobberies, final_net.weights, zero.policy=TRUE) %>% 
  as.data.frame()

# join local Moran's I results to fishnet
final_net.localMorans <- 
  cbind(local_morans, as.data.frame(final_net)) %>% 
  st_sf() %>%
  dplyr::select(Robbery_Count = countRobberies, 
                Local_Morans_I = Ii, 
                P_Value = `Pr(z != E(Ii))`) %>%
  mutate(Significant_Hotspots = ifelse((P_Value <= 0.05), 1, 0)) %>%
  gather(Variable, Value, -geometry)
  
## This is just for plotting
vars <- unique(final_net.localMorans$Variable)
varList <- list()

for(i in vars){
  varList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(final_net.localMorans, Variable == i), 
              aes(fill = Value), colour=NA) +
      scale_fill_viridis(name="") +
      labs(title=i) +
      theme_void() + theme(legend.position="bottom")}

do.call(grid.arrange,c(varList, ncol = 4, top = "Local Morans I statistics, Robbery"))

The scatterplot analysis revealed a positive correlation between the spatially lagged and non-spatial variables. Concurrently, the histogram demonstrated a right-skewed distribution, suggesting that the majority of the data points exhibit lower counts of robberies, with only a small portion displaying notably higher counts.

lmoran <- localmoran(final_net$countRobberies, final_net.weights, zero.policy=TRUE)

final_net$lmI <- lmoran[, "Ii"] # local Moran's I
final_net$lmZ <- lmoran[, "Z.Ii"] # z-scores
final_net$lmp <- lmoran[, "Pr(z != E(Ii))"]


mp <- moran.plot(as.vector(scale(final_net$countRobberies)), final_net.weights, zero.policy = TRUE)

##Create a hotspot variable:
final_net$hotspot <- 0
# high-high
final_net[(mp$x >= 0 & mp$wx >= 0) & (final_net$lmp <= 0.05), "hotspot"] <- 1


## 6.A histogram for dependent variable
ggplot(final_net) +
  geom_histogram(aes(x = countRobberies), binwidth = 0.5, fill = "#78B7C5", color = "#3B9AB2")

  labs(title = "Histogram of Robberies", x = "Values", y = "Frequency", caption = "Figure 3.2")
# generates warning from NN
final_net <- final_net %>% 
  mutate(countRobberies.isSig.dist = 
           nn_function(st_c(st_coid(final_net)),
                       st_c(st_coid(filter(final_net, 
                                           hotspot == 1))), 
                       k = 1))

# A small multiple map of model errors by random k-fold and spatial cross validation.
# A table of MAE and standard deviation MAE by regression.
# A table of raw errors by race context for a random k-fold vs. spatial cross validation regression.
  
# View(crossValidate)

## define the variables we want
reg.ss.vars <- c("Street_Lights_Out.nn", "Abandoned_Buildings.nn")


## RUN REGRESSIONS
reg.ss.spatialCV <- crossValidate(
  dataset = final_net,
  id = "name",                           
  dependentVariable = "countRobberies",
  indVariables = reg.ss.vars) %>%
    dplyr::select(cvID = name, countRobberies, Prediction, geometry)

reg.summary <- 
  rbind(
    mutate(reg.ss.spatialCV, Error = Prediction - countRobberies,
                             Regression = "Spatial LOGO-CV: Spatial Process")) %>% 
    st_sf() 

Figure 3.3 offers a visual representation of the model errors observed through both random k-fold and spatial cross-validation. These maps vividly illustrate the areas where higher errors manifest when the local spatial process is not considered. As anticipated, the most substantial errors are concentrated in hotspot locations, highlighting the significance of incorporating spatial dynamics into the modeling process.

# calculate errors by NEIGHBORHOOD
error_by_reg_and_fold_nb <- 
  reg.ss.spatialCV %>%
    group_by(cvID) %>% 
    summarize(Mean_Error = mean(Prediction - countRobberies, na.rm = T),
              MAE = mean(abs(Mean_Error), na.rm = T),
              SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()

error_by_reg_and_fold_nb %>% 
  arrange(desc(MAE))
error_by_reg_and_fold_nb %>% 
  arrange(MAE)
# A table of MAE and standard deviation MAE by neighborhood

error_by_reg_and_fold_nb_table <- st_drop_geometry(error_by_reg_and_fold_nb)

kable(error_by_reg_and_fold_nb_table, format = "markdown", caption = "Table of MAE and SD MAE by Neighborhood") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  footnote(general_title = "\n",
           general = "Table 3.1")
Table of MAE and SD MAE by Neighborhood
cvID Mean_Error MAE SD_MAE
Albany Park -0.0970174 0.0970174 0.0970174
Andersonville 0.4116973 0.4116973 0.4116973
Archer Heights -0.0650516 0.0650516 0.0650516
Armour Square 0.0087052 0.0087052 0.0087052
Ashburn 1.1649373 1.1649373 1.1649373
Auburn Gresham 0.1818266 0.1818266 0.1818266
Austin -0.7085945 0.7085945 0.7085945
Avalon Park 0.3798193 0.3798193 0.3798193
Avondale -0.4646127 0.4646127 0.4646127
Belmont Cragin 1.0040417 1.0040417 1.0040417
Beverly 0.9633438 0.9633438 0.9633438
Boystown -4.3639857 4.3639857 4.3639857
Bridgeport 1.1521622 1.1521622 1.1521622
Brighton Park 0.1097199 0.1097199 0.1097199
Bucktown -0.5701528 0.5701528 0.5701528
Burnside 1.3732724 1.3732724 1.3732724
Calumet Heights -0.5387842 0.5387842 0.5387842
Chatham -1.2379478 1.2379478 1.2379478
Chicago Lawn -0.4954622 0.4954622 0.4954622
Chinatown -8.1259169 8.1259169 8.1259169
Clearing 0.6662123 0.6662123 0.6662123
Douglas -0.2909807 0.2909807 0.2909807
Dunning 1.1540862 1.1540862 1.1540862
East Side 1.3451156 1.3451156 1.3451156
East Village 1.2598087 1.2598087 1.2598087
Edgewater -0.6415877 0.6415877 0.6415877
Edison Park 0.0648298 0.0648298 0.0648298
Englewood -0.4399830 0.4399830 0.4399830
Fuller Park 0.4217365 0.4217365 0.4217365
Gage Park -0.2898579 0.2898579 0.2898579
Galewood 1.4103275 1.4103275 1.4103275
Garfield Park -0.5206677 0.5206677 0.5206677
Garfield Ridge 1.0493535 1.0493535 1.0493535
Gold Coast 0.3751070 0.3751070 0.3751070
Grand Boulevard -1.1491789 1.1491789 1.1491789
Grand Crossing -0.3170011 0.3170011 0.3170011
Grant Park -1.7558691 1.7558691 1.7558691
Greektown -3.7359677 3.7359677 3.7359677
Hegewisch 0.8141177 0.8141177 0.8141177
Hermosa -0.7136081 0.7136081 0.7136081
Humboldt Park -2.6791850 2.6791850 2.6791850
Hyde Park -1.8018875 1.8018875 1.8018875
Irving Park 0.9010289 0.9010289 0.9010289
Jackson Park 0.2359467 0.2359467 0.2359467
Jefferson Park 0.7908426 0.7908426 0.7908426
Kenwood -0.4875413 0.4875413 0.4875413
Lake View -0.9181539 0.9181539 0.9181539
Lincoln Park -0.0414952 0.0414952 0.0414952
Lincoln Square 0.1120961 0.1120961 0.1120961
Little Italy, UIC 0.2636717 0.2636717 0.2636717
Little Village -0.3975314 0.3975314 0.3975314
Logan Square -1.2286492 1.2286492 1.2286492
Loop -3.7467150 3.7467150 3.7467150
Lower West Side -0.1207179 0.1207179 0.1207179
Mckinley Park 1.5829699 1.5829699 1.5829699
Millenium Park 0.3663248 0.3663248 0.3663248
Montclare 1.3420575 1.3420575 1.3420575
Morgan Park 1.2387910 1.2387910 1.2387910
Mount Greenwood 0.2143686 0.2143686 0.2143686
Museum Campus 0.2397851 0.2397851 0.2397851
Near South Side -1.1818619 1.1818619 1.1818619
New City 0.2937455 0.2937455 0.2937455
North Center 0.9756040 0.9756040 0.9756040
North Lawndale -0.5184459 0.5184459 0.5184459
North Park 0.6640912 0.6640912 0.6640912
Norwood Park 0.5941376 0.5941376 0.5941376
Oakland 0.1800966 0.1800966 0.1800966
Old Town 0.1455655 0.1455655 0.1455655
Portage Park 0.9387018 0.9387018 0.9387018
Printers Row -0.5492857 0.5492857 0.5492857
Pullman 0.9548150 0.9548150 0.9548150
River North -2.3712690 2.3712690 2.3712690
Riverdale 0.3092909 0.3092909 0.3092909
Rogers Park -0.9334060 0.9334060 0.9334060
Roseland 0.7980137 0.7980137 0.7980137
Rush & Division -2.6248167 2.6248167 2.6248167
Sauganash,Forest Glen 0.6387870 0.6387870 0.6387870
Sheffield & DePaul -0.5835738 0.5835738 0.5835738
South Chicago 0.2900940 0.2900940 0.2900940
South Deering 0.3952048 0.3952048 0.3952048
South Shore -1.6108572 1.6108572 1.6108572
Streeterville -3.0441790 3.0441790 3.0441790
Ukrainian Village -1.3675526 1.3675526 1.3675526
United Center -0.5877284 0.5877284 0.5877284
Uptown -0.6988892 0.6988892 0.6988892
Washington Heights 0.7792835 0.7792835 0.7792835
Washington Park 0.1752635 0.1752635 0.1752635
West Elsdon 0.4905013 0.4905013 0.4905013
West Lawn 0.3336635 0.3336635 0.3336635
West Loop -0.9487562 0.9487562 0.9487562
West Pullman 1.3370355 1.3370355 1.3370355
West Ridge -0.6960832 0.6960832 0.6960832
West Town -1.7372681 1.7372681 1.7372681
Wicker Park -2.8386407 2.8386407 2.8386407
Woodlawn -0.9301051 0.9301051 0.9301051
Wrigleyville -0.9920004 0.9920004 0.9920004

Table 3.1
# 7.A small multiple map of model errors by random k-fold and spatial cross validation.

# change it to long format
error_by_reg_and_fold_long <- tidyr::pivot_longer(
  error_by_reg_and_fold_nb,
  cols = c(Mean_Error, MAE, SD_MAE),
  names_to = "Metric",
  values_to = "Value"
)

# Create a ggplot object with facets
ggplot(error_by_reg_and_fold_long) +
  geom_sf(aes(fill=Value)) +
  scale_fill_viridis(name="Errors Multiple Small Maps") +
  labs(title="Errors Multiple Small Maps", caption = "Figure 3.3") +
  facet_wrap(~ Metric) +  # Facet by the new variable
  theme_void()

## plot histogram of OOF (out of fold) errors
error_by_reg_and_fold_nb %>%
  ggplot(aes(MAE)) + 
    geom_histogram(bins = 30, colour="#E1AF00", fill = "#EBCC2A") +
  scale_x_continuous(breaks = seq(0, 11, by = 1)) + 
    labs(title="Distribution of MAE", subtitle = "LOGO-CV",
         x="Mean Absolute Error", y="Count") 

Examining the racial context depicted in Figure 3.4 reveals the predominant racial demographics across Chicago. The central and southern parts of the city appear predominantly inhabited by Majority Non-White residents, while the northern areas are predominantly occupied by Majority White residents.

tracts22 <- 
  get_acs(geography = "tract", variables = c("B01001_001E","B01001A_001E"), 
          year = 2022, state=17, county=031, geometry=T) %>%
  st_transform('ESRI:102271')  %>% 
  dplyr::select(variable, estimate, GEOID) %>%
  spread(variable, estimate) %>%
  rename(TotalPop = B01001_001,
         NumberWhites = B01001A_001) %>%
  mutate(percentWhite = NumberWhites / TotalPop,
         raceContext = ifelse(percentWhite > .5, "Majority_White", "Majority_Non_White")) %>%
  .[neighborhoods,]

# Map racial context
ggplot(tracts22) +
  geom_sf(aes(fill=raceContext, na.rm = TRUE)) +
  scale_fill_manual(values = c("Majority_Non_White" = "#3B9AB2", "Majority_White" = "#E1AF00"), name = "Race") +
  labs(title="Race Context, Chicago - 2022", caption = "Figure 3.4") +
  theme_void()

Understanding the racial context is crucial for comprehending potential biases inherent in the model. Table 3.2 provides a comparison of average errors for the LOGO-CV regressions across different racial contexts, achieved by linking the fishnet grid cell centroids to tract boundaries. The results indicate that, on average, the model exhibits a favorable tendency of underpredicting risk in Minority areas while overpredicting in White areas. Such a model configuration has the potential to mitigate unfair resource allocation by preventing disproportionate policing in Black and Brown communities.

reg.summary_table <- reg.summary %>%
  st_centroid() %>%
    st_join(tracts22) %>%
    na.omit() %>%
      st_drop_geometry() %>%
      group_by(Regression, raceContext) %>%
      summarize(mean.Error = mean(Error, na.rm = T)) %>%
      spread(raceContext, mean.Error)

reg.summary_table <- st_drop_geometry(reg.summary_table)

kable(reg.summary_table, format = "markdown", caption = "Table of Mean Error by Racial Context") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F) %>%
  footnote(general_title = "\n",
           general = "Table 3.2")
Table of Mean Error by Racial Context
Regression Majority_Non_White Majority_White
Spatial LOGO-CV: Spatial Process -0.0562749 0.092254

Table 3.2
# error_by_reg_and_fold <- 
#   reg.summary %>%
#     group_by(Regression, cvID) %>% 
#     summarize(Mean_Error = mean(Prediction - countRobberies, na.rm = T),
#               MAE = mean(abs(Mean_Error), na.rm = T),
#               SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
#   ungroup()
# 
# error_by_reg_and_fold_table <- st_drop_geometry(error_by_reg_and_fold)%>%
#   group_by(Regression) %>% 
#     summarize(Mean_MAE = round(mean(MAE), 2),
#               SD_MAE = round(sd(MAE), 2))
# 
# kable(error_by_reg_and_fold_table, format = "markdown", caption = "Table of MAE and SD MAE by Regression") %>%
#   kable_styling("striped", "hover", full_width = F) %>%
#     row_spec(2, color = "black", background = "#FDE725FF") %>%
#     row_spec(4, color = "black", background = "#FDE725FF") %>%
#   footnote(general_title = "\n",
#            general = "Table 3.3")

In this section, we explore whether predictive risk assessments surpass conventional ‘Kernel density’ hotspot mapping, a widely adopted technique by law enforcement agencies worldwide to allocate resources efficiently to high-crime areas. To introduce a temporal dimension, we employ hotspot and risk predictions derived from 2022 robbery data to forecast robbery locations in 2023.

Kernel density operates by placing a smooth curve over each crime point, peaking directly above the point and tapering off within a circular search radius. The density at any given location is the sum of these underlying curves, resulting in higher densities in areas with greater proximity to multiple crime points. A critical aspect of kernel density is its reliance on a global search radius parameter, influencing the scale of analysis. Conceptually, kernel density mirrors ‘predictions’ driven primarily by spatial autocorrelation.

Figure 3.5 illustrates three Kernel density maps across varying scales, revealing distinct burglary hotspot patterns contingent upon the chosen radius.

# A small multiple map of your risk factors in the fishnet (counts, distance and/or other feature engineering approaches)

# demo of kernel width
rob_ppp <- as.ppp(st_coordinates(robbery), W = st_bbox(final_net))
rob_KD.1000 <- density.ppp(rob_ppp, 1000)
rob_KD.1500 <- density.ppp(rob_ppp, 1500)
rob_KD.2000 <-density.ppp(rob_ppp, 2000)
rob_KD.df <- rbind(
  mutate(data.frame(rasterToPoints(mask(raster(rob_KD.1000), as(neighborhoods, 'Spatial')))), Legend = "1000 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(rob_KD.1500), as(neighborhoods, 'Spatial')))), Legend = "1500 Ft."),
  mutate(data.frame(rasterToPoints(mask(raster(rob_KD.2000), as(neighborhoods, 'Spatial')))), Legend = "2000 Ft.")) 

rob_KD.df$Legend <- factor(rob_KD.df$Legend, levels = c("1000 Ft.", "1500 Ft.", "2000 Ft."))

ggplot(data=rob_KD.df, aes(x=x, y=y)) +
  geom_raster(aes(fill=layer)) + 
  facet_wrap(~Legend) +
  coord_sf(crs=st_crs(final_net)) + 
  scale_fill_viridis(name="Density") +
  labs(title = "Kernel density with 3 different search radii", caption= "Figure 3.5") +
  theme_void()

The following code snippet generates a Kernel density map with a 1000-foot search radius using the spatstat package. Utilizing the as.ppp function, burglary coordinates are converted into a ppp class. Subsequently, the density function is employed to compute the Kernel density. For visualization purposes, the ppp object is transformed into a data frame and then an sf layer. By spatially joining points to the final_net and calculating the mean density, the resulting density is visualized with a sample size of 1,500 points overlaid on top for clarity.

# A map of your outcome joined to the fishnet

as.data.frame(rob_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) %>%
   ggplot() +
     geom_sf(aes(fill=value)) +
     geom_sf(data = sample_n(robbery, 1500), size = .5) +
     scale_fill_viridis(name = "Density") +
     labs(title = "Kernel density of 2022 robberies", caption = "Figure 3.6") +
     theme_void()

Let’s examine the performance of our model compared to Kernel Density (KD) on the data from the subsequent year. Subsequently, we categorize the outcomes into five distinct groups for analysis.

robbery23 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2023/xguy-4ndq") %>% 
  filter(Primary.Type == "ROBBERY" & 
         Description == "ARMED - HANDGUN") %>%
  mutate(x = gsub("[()]", "", Location)) %>%
  separate(x,into= c("Y","X"), sep=",") %>%
  mutate(X = as.numeric(X),
         Y = as.numeric(Y)) %>% 
  na.omit %>%
  st_as_sf(coords = c("X", "Y"), crs = 4326, agr = "constant") %>%
  st_transform('ESRI:102271') %>% 
  distinct() %>%
  .[fishnet,]

rob_KDE_sum <- as.data.frame(rob_KD.1000) %>%
  st_as_sf(coords = c("x", "y"), crs = st_crs(final_net)) %>%
  aggregate(., final_net, mean) 
kde_breaks <- classIntervals(rob_KDE_sum$value, 
                             n = 5, "fisher")
rob_KDE_sf <- rob_KDE_sum %>%
  mutate(label = "Kernel Density",
         Risk_Category = classInt::findCols(kde_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery23) %>% mutate(robCount = 1), ., sum) %>%
    mutate(robCount = replace_na(robCount, 0))) %>%
  dplyr::select(label, Risk_Category, robCount)

ml_breaks <- classIntervals(reg.ss.spatialCV$Prediction, 
                             n = 5, "fisher")
rob_risk_sf <-
  reg.ss.spatialCV %>%
  mutate(label = "Risk Predictions",
         Risk_Category =classInt::findCols(ml_breaks),
         Risk_Category = case_when(
           Risk_Category == 5 ~ "5th",
           Risk_Category == 4 ~ "4th",
           Risk_Category == 3 ~ "3rd",
           Risk_Category == 2 ~ "2nd",
           Risk_Category == 1 ~ "1st")) %>%
  cbind(
    aggregate(
      dplyr::select(robbery23) %>% mutate(robCount = 1), ., sum) %>%
      mutate(robCount = replace_na(robCount, 0))) %>%
  dplyr::select(label,Risk_Category, robCount)

High-risk classifications with minimal observed burglaries in 2023 could indicate either latent risk or a model that is inadequately calibrated. This uncertainty underscores the challenge of evaluating the accuracy of geospatial risk models. However, incorporating additional features or conducting further feature engineering could potentially improve model performance.

# The map comparing kernel density to risk predictions for the next year’s crime.

rbind(rob_KDE_sf, rob_risk_sf) %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category, -geometry) %>%
  ggplot() +
    geom_sf(aes(fill = Risk_Category), colour = NA) +
    geom_sf(data = sample_n(robbery23, 3000), size = .5, colour = "black") +
    facet_wrap(~label, ) +
    scale_fill_viridis(discrete = TRUE) +
    labs(title="Comparison of Kernel Density and Risk Predictions",
         subtitle="2022 burglar risk predictions; 2023 burglaries",
         caption = "Figure 3.7") +
    mapTheme(title_size = 14)

Lastly, depicted in Figure 3.7 below is a bar chart illustrating the rates of 2023 robberies across risk categories and model types. Ideally, a well-calibrated model would demonstrate that risk predictions encompass a larger portion of 2023 burglaries within the highest risk category compared to Kernel density.

The risk prediction model slightly surpasses Kernel Density in the top two highest risk categories, indicating that this straightforward model offers some utility compared to the conventional hotspot approach.

# The bar plot making this comparison.
# Two paragraphs on why or why not you would recommend your algorithm be put into production.

rbind(rob_KDE_sf, rob_risk_sf) %>%
  st_drop_geometry() %>%
  na.omit() %>%
  gather(Variable, Value, -label, -Risk_Category) %>%
  group_by(label, Risk_Category) %>%
  summarize(countRobberies = sum(Value)) %>%
  ungroup() %>%
  group_by(label) %>%
  mutate(Pcnt_of_test_set_crimes = countRobberies / sum(countRobberies)) %>%
    ggplot(aes(Risk_Category,Pcnt_of_test_set_crimes)) +
      geom_bar(aes(fill=label), position="dodge", stat="identity") +
      scale_fill_manual(values = c("Kernel Density" = "#3B9AB2", "Risk Predictions" = "#E1AF00"), name = "Model") +
      labs(title = "Risk prediction vs. Kernel density, 2023 robberies",
           y = "% of Test Set Burglaries (per model)",
           x = "Risk Category",
           caption = " Figure 3.8") +
  theme_bw() +
      theme(axis.text.x = element_text(angle = 45, vjust = 0.5))

4. Conclusion

The study underscores the importance of developing predictive policing models that effectively address selection bias and systemic inequalities. By leveraging innovative features such as data on abandoned buildings and street lighting, the model aims to enhance accuracy and generalizability while mitigating biases ingrained in the data. Through iterative refinement and rigorous evaluation, the study demonstrates the potential of predictive policing to optimize resource allocation and crime prevention strategies. However, challenges such as temporal variations and data quality biases highlight the ongoing need for methodological improvements and comprehensive data integration. Overall, the study contributes to the dialogue on responsible predictive policing practices, emphasizing the importance of ethical considerations and equitable outcomes.

In summary, the conclusion highlights the study’s contributions, acknowledges its limitations, and calls for further research to advance the field of predictive policing while ensuring fairness and transparency in its implementation.